Activate FlashHead under vllm serve by WilhelmTr · Pull Request #2 · embedl/flash-head

WilhelmTr · 2026-04-22T14:14:48Z

Summary

Add patch_async_llm targeting AsyncLLM.__init__, so the FlashHead metadata-load runs under both the Python LLM(...) API and vllm serve. The existing patch_llm only covers LLMEngine.from_engine_args, which vllm serve never reaches in vLLM 0.19 (the OpenAI entrypoint goes through AsyncLLM.from_vllm_config then AsyncLLM.__init__).
Drop the negative-result cache in logits_processor._get_flash_head so a metadata file that appears after server startup is still picked up on the next decode step.
Bump to 0.1.10 to trigger the PyPI release.

What went wrong today (repro)

With flash-head==0.1.9 installed against vllm==0.19.1, running

vllm serve embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead \
    --max-model-len 8192 --gpu-memory-utilization 0.75 --max-num-seqs 2

starts up and serves correctly, but /tmp/flashhead_metadata.pt is never written. get_flash_head() returns None, and the patched LogitsProcessor._get_logits falls straight through to the original dense path on every decode step, so FlashHead is effectively disabled under vllm serve, silently.

Traced through: vllm.entrypoints.openai.api_server.build_async_engine_client_from_engine_args calls AsyncLLM.from_vllm_config which calls AsyncLLM.__init__. LLMEngine.from_engine_args (the legacy class the current patch targets) is never called. The Python LLM(...) API still works because LLM.__init__ does call LLMEngine.from_engine_args.

Verification

Before: no [FlashHead] Loaded lazily... log from either process after startup, no /tmp/flashhead_metadata.pt, dense-head fallback.

After, with this PR:

flash_head.patches.async_llm INFO [FlashHead] Patched AsyncLLM.__init__
flash_head.patches INFO [FlashHead] All patches applied
flash_head INFO [FlashHead] Plugin registered
flash_head.loading INFO [FlashHead] Metadata prepared for lazy loading from flash_head_assets
flash_head.patches.async_llm INFO [FlashHead] Metadata saved for model: embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead
...
flash_head.loading INFO [FlashHead] Loaded lazily on GPU using 'lm_head.weight'

Exact curl from the README returns a coherent detailed video description.

Note (not fixed here)

vLLM's DEFAULT_LOGGING_CONFIG only attaches a handler to the vllm logger, so every [FlashHead] ... INFO line is dropped unless the user sets VLLM_LOGGING_CONFIG_PATH to a config that includes a flash_head logger. Worth either adding a handler inside register(), or mentioning in the README that the activation banner won't appear under vllm serve by default.

vLLM 0.19 reaches `AsyncLLM.__init__` through `AsyncLLM.from_vllm_config` for the OpenAI server, skipping `LLMEngine.from_engine_args`. That left `set_flash_head(metadata)` uncalled under `vllm serve`, so the patched `_get_logits` always saw `get_flash_head() is None` and silently fell back to the dense lm_head on every decode step. Add a mirror of patch_llm that targets `AsyncLLM.__init__` so the metadata is written on both paths, and stop caching the None result in `_get_flash_head` so a late-arriving metadata file is picked up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JonnaMat · 2026-04-22T14:33:35Z

 logger = logging.getLogger(__name__)

-# Sentinel for lazy loading
-_FLASH_HEAD_NOT_LOADED = object()


This is needed since get_flash_head() may be None (e.g., when running non-FlashHead models).

JonnaMat · 2026-04-22T14:40:55Z

+        return None
+
+
+def patch_async_llm():


I think we should add a guard for idempotence (similar to what we do in logits_processor.py) [if _flash_head is None:...]

While AsyncLLm is run only once per engine construction (not per decode / request) there may be other parts of vllm that call it. We could add a _FLASH_HEAD_NOT_LOADED.

WilhelmTr requested a review from JonnaMat April 22, 2026 14:17

JonnaMat requested changes Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activate FlashHead under vllm serve#2

Activate FlashHead under vllm serve#2
WilhelmTr wants to merge 1 commit intomasterfrom
fix/vllm-serve-activation

WilhelmTr commented Apr 22, 2026 •

edited

Loading

Uh oh!

JonnaMat Apr 22, 2026

Uh oh!

JonnaMat Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

WilhelmTr commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What went wrong today (repro)

Verification

Note (not fixed here)

Uh oh!

JonnaMat Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

JonnaMat Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WilhelmTr commented Apr 22, 2026 •

edited

Loading